NAR Genomics and Bioinformatics
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match NAR Genomics and Bioinformatics's content profile, based on 214 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.
WANG, Z.; Arsuaga, J.
Show abstract
Computational bacteriophage host prediction from genomic sequences remains challenging because host range depends on diverse, rapidly evolving genomic determinants--from receptor-binding proteins to anti-defense systems and downstream infection compatibility--and because the signals available to predictors, including sequence homology, CRISPR spacer matches, nucleotide composition, and mobile genetic elements, are sparse, unevenly distributed across taxa, and constrained by incomplete host annotations. Here, we frame host prediction as an unsupervised retrieval problem. We asked whether embeddings from the pretrained genome language model Evo2 captured a reliable host-range signal without training on phage-host labels. We generated whole-genome embeddings for phages and candidate bacterial hosts with the Evo2-7B model, applied normalization, and ranked hosts by cosine similarity. Using the Virus-Host Database, we selected embedding and fusion choices on a Gram-positive validation cohort and then evaluated the approach on a held-out Gram-negative test cohort to minimize data leakage. We found that Evo2 was strongest at retrieving multiple plausible hosts, with the recorded host in the top 10 for 55.4% of phages. However, it did not maximize species-level top-1 accuracy (19.4% vs. 23.2% for the best baseline). At higher taxonomic ranks, Evo2 captured a coarser host-range signal: top-1 accuracy reached 43.4% at the genus level and 51.6% at the family level. Reciprocal rank fusion of Evo2 with BLASTN, VirHostMatcher, and PHIST improved all retrieval metrics. Top-10 retrieval rose to 58.5% and top-1 accuracy to 26.9%. Stratified analyses by phage genome length, host clade, and host mobile genetic element coverage revealed scenario-dependent performance. Evo2 embeddings excelled for intermediate-length phages and when host mobile element content was low, whereas alignment and k-mer methods dominated when local homology was abundant. These results suggest that pretrained genome embeddings complement established alignment- and k-mer/composition-based methods and that context-aware hybrid pipelines may help improve phage host prediction. Author summaryBacteriophages are viruses that prey on bacteria and play central roles in microbial ecosystems, nutrient cycling, and the spread of antibiotic resistance genes. Knowing which bacterium a phage can infect is important for applications such as phage therapy, where viruses are used to treat bacterial infections, but making this prediction from DNA sequence data alone remains difficult. Existing computational tools each exploit different types of genomic evidence, and none works reliably across all settings. We asked whether an artificial intelligence model trained to read raw DNA--without ever being shown which phages infect which hosts--could contribute a new, complementary signal. We found that this approach was particularly effective at narrowing the field to a short list of candidate hosts and at capturing broad evolutionary relationships between phages and bacteria. When we combined it with established sequence-comparison tools, overall prediction improved beyond what any single method achieved alone. By examining when each method succeeded or failed, we identified biological factors that govern prediction difficulty, offering practical guidance for building more robust prediction systems.
Wolfram-Schauerte, M.; Trust, C.; Waffenschmidt, N.; Nieselt, K.
Show abstract
Time-resolved transcriptomic profiling has been used to study phage-host interactions for more than a decade. However, the resulting datasets are not readily accessible for custom re-analysis, and resources are lacking that provide standardized processing, storage, and analysis of transcriptomes from phage infections. Here, we present the PhageExpressionAtlas, the first bioinformatics resource for storing time-resolved dual RNA-sequencing data from phage infections. This data was processed uniformly using a custom analysis pipeline and is presented for interactive exploration through visualisation. The PhageExpressionAtlas currently hosts 42 datasets from 23 studies. Using the PhageExpressionAtlas, we replicate key findings from original publications and extend hypothesis testing across multiple phage-host systems. By systematically querying and analyzing the underlying database, we evaluate approaches to phage gene classification and show that uncharacterized phage genes are expressed across all infection phases. Moreover, we provide a comprehensive view of the expression dynamics of anti-phage defenses as well as host- and phage-encoded anti-defense systems in the infection context, indicating unique and conserved patterns of transcriptional regulation underlying bacterial anti-phage immunity and phage counter-strategies. Together, the PhageExpressionAtlas is a unifying resource that democratizes transcriptomics-driven analyses of phage-host interactions and supports integrative cross-study assessment.
Forcier, T.; Cheng, E.; Tam, O. H.; Wunderlich, C.; Castilla-Vallmanya, L.; Jones, J. L.; Quaegebeur, A.; Barker, R. A.; Jakobsson, J.; Gale Hammell, M.
Show abstract
Transposable elements (TEs) are mobile genetic sequences that can generate new copies of themselves via insertional mutations. These viral-like sequences comprise nearly half the human genome and are present in most genome wide sequencing assays. While only a small fraction of genomic TEs have retained their ability to transpose, TE sequences are often transcribed from their own promoters or as part of larger gene transcripts. Accurately assessing TE expression from each individual genomic TE locus remains an open problem in the field, due to the highly repetitive nature of these multi-copy sequences. These issues are compounded in single-cell and single-nucleus transcriptome experiments, where additional complications arise due to sparse read coverage and unprocessed mRNA introns. Here we present our tool for single-cell TE and gene expression analysis, TEsingle. Using synthetic datasets, we show the problems that arise when not properly accounting for intron retention events, failing to address uncertainty in alignment scoring, and failing to make use of unique molecular identifiers for transcript resolution. Addressing these challenges has enabled an accurate TE analysis suite that simultaneously tracks gene expression as well as locus-specific resolution of expressed TEs. We showcase the performance of TEsingle using single-nucleus profiles from substantia nigra (SN) tissues of Parkinsons Disease (PD) patients. We find examples of young and intact TEs that mark dopaminergic neurons (DA) as well as many young TEs from the LINE and ERV families that are elevated in PD neurons and glia. These results demonstrate that TE expression is highly cell-type and cellular-state specific and elevated in particular subsets of neurons, astrocytes, and microglia from PD patients.
Abbasi, M.; Ochoa Zermeno, S.; Spendlove, M. D.; Tashi, Z.; Plaisier, C. L.; Bartelle, B. B.
Show abstract
Interpretable representations of gene expression are used to define cellular identities and the molecular programs active within cells, two related, but distinct phenomena. In the case of microglia, a cell type with high transcriptomic, functional, and morphological heterogeneity, the predominant representation of transcriptomic data presumes the adoption of distinct molecular identities, despite a lack of easily separable transcriptional states. Here, we explore alternative transcriptomic representations by comparing two single-cell analysis methods: differential expression analysis for identities and co-expression network analysis for molecular programs. For microglia, co-expression network analysis identifies highly significant functional ontologies not resolved by differential expression analysis. The identified co-expression modules are preserved across transcriptomic datasets and suggest reducible functional programs that activate and modulate depending on context. We conclude that co-expression analysis constitutes a best practice for single cell analysis of an individual cell type and describing microglia function as concurrent molecular programs offers a more parsimonious model of microglia function.
Zhang, X.
Show abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
May, G. E.; Akirtava, C.; McManus, J.
Show abstract
Since the discovery of viral Internal Ribosome Entry Sites (IRESes), researchers have sought to find similar elements in mammalian host genes, termed "cellular IRESes". However, the plasmid systems used to measure cellular IRES activity are vulnerable to false positives due to promoter activity in candidate IRESes. Orthogonal methods are needed to validate putative IRESes while carefully avoiding artifacts known to cause false positives. Recently, Koch et al. proposed approaches for studying IRESes, primarily circular RNA-generating plasmids, and for validating mRNA transcripts using smFISH and qRT-PCR. Here, we demonstrate confounding variables and artifacts in each of these approaches that can lead to inappropriate conclusions about potential cellular IRES activity. We show the back-splicing circRNA plasmid creates linear mRNA artifacts associated with false-positive IRES signals. Using orthogonal, gold-standard assays validated with viral IRESes, we find putative cellular IRESes reported using the back-splicing plasmid have no IRES activity. Furthermore, we demonstrate that smFISH and qRT-PCR can misidentify nuclear non-coding RNAs as mRNAs and we validate a single molecule sequencing assay for identifying genuine mRNA 5 ends. Our work establishes reliable methods for robust transcript annotation and IRES studies that avoid documented artifacts arising from bicistronic and back-splicing circRNA plasmid reporters.
Pan, L.; Chen, M.; Tanik, M.
Show abstract
The information encoded in DNA sequences can be rigorously quantified using Shannon entropy and related measures. When placed in an evolutionary context, this quantification offers a principled yet underexplored route to constructing gene regulatory networks (GRNs) directly from sequence data. While most GRN inference methods rely exclusively on gene expression profiles, the regulatory code is ultimately written in the DNA sequence itself. Here we review the mathematical foundations of information theory as applied to gene sequences, survey existing computational methods for GRN inference--with emphasis on information-theoretic and sequence-based approaches--and examine how evolutionary conservation constrains sequence entropy to preserve biological function. We then propose a four-layer integrative framework that combines per-position Shannon entropy profiles, evolutionary conservation scoring via Jensen- Shannon divergence, expression-based mutual information and transfer entropy, and DNA foundation model embeddings to construct GRNs from sequence. Through worked examples on the Escherichia coli SOS regulatory sub-network, we demonstrate how conservation-weighted mutual information improves edge discrimination and how transfer entropy resolves regulatory directionality. The framework generates testable predictions: edges supported by low-entropy regulatory regions should show higher experimental validation rates, and cross-species entropy profile conservation should predict GRN topology conservation. This work bridges three scales of biological information--nucleotide-level entropy, evolutionary constraint patterns, and network-level regulatory logic--establishing information entropy as the natural mathematical language for sequence-to-network regulatory inference.
Albuja, D. S.; Maldonado, P. S.; Zambrano, P. E.; Olmos, J. R.; Vera, E. R.
Show abstract
Accurate fungal species identification is critical for microbial ecology, food safety, and plant pathology. However, morphological limitations and genomic complexity hinder this process. Molecular markers such as the ITS region, along with Oxford Nanopore long-read sequencing, offer a robust solution, albeit limited by error rates in homopolymeric regions and a high dependence on advanced computational resources (GPUs) to achieve high accuracy. This study benchmarks two bioinformatics workflows on a multiplexed dataset of complex fungal communities to address this technological gap: a CPU-based workflow optimized using a Bayesian machine learning engine and a GPU-accelerated workflow incorporating "super high accuracy" (SUP) models and refinement with neural networks. The results establish a scalable framework for evaluating the impact of computational architecture on final taxonomic resolution. It is demonstrated that GPU processing maximizes data retention and species-level accuracy by correcting systematic errors. Alternately, implementing automated hyperparameter optimization in CPU environments stabilizes sequence clustering and achieves high taxonomic concordance at the genus level. This conceptual advance validates the feasibility of performing ITS metabarcoding analysis in resource-constrained infrastructures, thus providing the scientific community with a reproducible protocol that balances the need for taxonomic precision with hardware availability.
Napoli, A.; Liorni, N.; Biagini, T.; Giovannetti, A.; Squitieri, A.; Miele, L.; Urbani, A.; Caputo, V.; Gasbarrini, A.; Squitieri, F.; Mazza, T.
Show abstract
Short tandem repeat expansions in exon 1 of the HTT gene drive Huntingtons disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics, successfully distinguishing the higher somatic expansion burden in brain tissues compared to peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=72 SRC="FIGDIR/small/713334v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@17a54aforg.highwire.dtl.DTLVardef@4dcfc5org.highwire.dtl.DTLVardef@8398edorg.highwire.dtl.DTLVardef@1acefde_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: STRmie-HD flowchart. STRmie-HD is a comprehensive analytical framework that processes sequencing reads to analyze CAG/CCG trinucleotide repeats, interruption variants, and somatic mosaicism in the HTT gene. The workflow begins with sequencing reads (FASTA/FASTQ) that can undergo optional custom processing eq]based on the sequencing design. These reads are then fed into a regular expression-based engine (STRmie-HD) to identify CAG and CCG motifs. The identified motifs lead to the estimation of CAG/CCG alleles, visualized as distinct peaks representing different allele sizes, interruption variant assessment, and somatic mosaicism quantification. STRmie-HD produces an HTML output that wraps this information into a report. C_FIG
Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.
Show abstract
BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.
Muneeb, M.; Ascher, D.
Show abstract
Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype-fold combinations confirmed significant differences in tool rankings ({chi}2 = 102.29, p = 2.57 x 10-11), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.
Nguyen-Hoang, A.; Arslan, K.; Kopalli, V.; Windpassinger, S.; Perovic, D.; Stahl, A.; Golicz, A.
Show abstract
Hi-C data is commonly used for reference-free de novo scaffolding. However, with the rapid increase in high-quality reference genomes, reference-guided workflows are now more practical for assembling large numbers of target genomes without relying on costly and labor-intensive Hi-C sequencing. Recently, a pangenome graph-based haplotype sampling algorithm was introduced to generate personalized graphs for target genomes. Such graphs have strong potential as references for reference-guided contig scaffolding. Here, we present noHiC, a reference-guided scaffolding pipeline supporting key steps of plant contig scaffolding. A distinctive feature of noHiC is the nohic-refpick script, generating a best-fit synthetic reference (synref) from a pangenome graph that is genetically close to the target contigs. This enables the integration of genetic information from many references (up to 48 in our tests) without using them separately during scaffolding. Synrefs showed advantages over highly contiguous conventional references in reducing false contig breaking during reference-based correction. Additionally, nohic-refpick can be combined with fast scaffolders (ntJoin) to rapidly produce highly contiguous assemblies using synrefs derived from pangenome graphs. The noHiC pipeline, used alone or in combination with ntJoin, can generally produce assemblies that are structurally consistent with public Hi-C-based or manually curated genomes. The pipeline is publicly available at https://github.com/andyngh/noHiC. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=82 SRC="FIGDIR/small/712436v1_ufig1.gif" ALT="Figure 1"> View larger version (9K): org.highwire.dtl.DTLVardef@40bd8forg.highwire.dtl.DTLVardef@5d2bbborg.highwire.dtl.DTLVardef@e214a3org.highwire.dtl.DTLVardef@b90b06_HPS_FORMAT_FIGEXP M_FIG C_FIG
Rikk, L.; Ghaffarinia, A.; Leigh, N. D.
Show abstract
Accurate genome annotation remains challenging as assembly quality often exceeds annotation reliability. Resolving ambiguities of gene presence, absence, and orthology typically requires integrating two complementary lines of evidence: sequence homology between species and the conservation of gene order (i.e., synteny). BLAST remains the standard for homology detection, yet its raw output can be difficult to interpret. Existing tools address this challenge but operate at opposing scales. Alignment viewers provide detailed pairwise statistics without genomic context, while synteny tools offer chromosome-scale perspectives without sequence-level resolution. To fill this intermediate gap, we developed Novabrowse, an interactive BLAST results interpretation framework featuring high-resolution multi-species synteny analysis, chromosomal re-arrangement investigation, ortholog detection, and gene signal discovery. Users define a genomic region of interest in a query species and/or use custom sequences, then select one or more subject species for comparison. The pipeline retrieves query gene sequences via NCBI API integration and performs BLAST searches against each subject transcriptome or genome. Results are presented via an interactive HTML file featuring alignment statistics, chromosomal maps, coverage visualizations, ribbon plots, and distance-based clustering of high-scoring segment pairs into putative gene units. We demonstrate these capabilities by investigating Foxp3, Aire, and Rbl1, three highly conserved vertebrate genes, in the recently assembled genome of the newt Pleurodeles waltl. Foxp3 and Aire have not been described in any salamander species to date, despite availability of multiple assemblies and extensive transcriptomic datasets. Using Novabrowse, we discovered conserved loci and gene signals for both genes in P. waltl, the presence of which was subsequently confirmed via Nanopore long-read RNA sequencing. In contrast, Rbl1 analysis uncovered a chromosomal rearrangement at its expected locus with no gene signal detected, indicating a gene loss specific to P. waltl despite the genes retention in the closely related axolotl (Ambystoma mexicanum). Our findings demonstrate Novabrowses capacity for evidence-based evaluation of annotation artifacts, an essential capability as high-quality assemblies become more available for phylogenetically diverse species. Novabrowse is open source (MIT license) and freely available at: https://github.com/RegenImm-Lab/Novabrowse.
Quadrini, M.; Tesei, L.
Show abstract
The ability to access, search, and analyse large collections of RNA molecules together with their secondary structure and evolutionary context is essential for comparative and phylogeny-driven studies. Although RNA secondary structure is known to be more conserved than primary sequence, no existing resource systematically associates individual RNA molecules with curated phylogenetic classifications. Here, we introduce PhyloRNA, a curated meta-database that provides large-scale access to RNA secondary structures collected from public resources or derived from experimentally resolved 3D structures. PhyloRNA allows users to search, select, and download extensive sets of RNA molecules in multiple textual formats, each entry being explicitly linked to phylogenetic annotations derived from five curated taxonomy systems. In addition to taxonomic information, each RNA molecule is accompanied by a rich set of descriptors, including pseudoknot order, genus, and three levels of structural abstraction--Core, Core Plus, and Shape--which facilitate comparative analyses across sets of molecules. PhyloRNA is publicly available at https://bdslab.unicam.it/phylorna/ and is regularly updated to incorporate newly available data and revised taxonomic annotations.
Lisiecka, A.; Kowalewska, A.; Dojer, N.
Show abstract
Pangenome graphs conveniently represent genetic variation within a population. Several types of such graphs have been proposed, with varying properties and potential applications. Among them, variation graphs (VGs) seem best suited to replace reference genomes in sequencing data processing, while whole genome alignments (WGAs) are particularly practical for comparative genomics applications. For both models, no widely accepted optimization criteria for a graph representing a given set of genomes have been proposed. In the current paper we introduce the concept of homology relation induced by a pangenome graph on the characters of represented genomic sequences and define such relations for both VG and WGA model. Then, we use this concept to propose homology-based metrics for comparing different graphs representing the same genome collection, and to formulate the desired properties of transformations between VG and WGA models. Moreover, we propose several such transformations and examine their properties on pangenome graph data. Finally, we provide implementations of these transformations in a package WGAtools, available at https://github.com/anialisiecka/WGAtools.
Lan, W.; Wang, D.; Chen, W.; Yan, X.; Chen, Q.; Pan, S.; Pan, Y.
Show abstract
MotivationtRNA-derived small RNAs (tsRNAs) have emerged as a novel class of regulatory molecules implicated in the pathogenesis of many human diseases, making them as promising biomarkers and therapeutic targets. However, existing computational methods for tsRNA-disease association prediction often overlook explicit biological attributes and complex feature interactions, limiting their predictive performance. ResultsWe propose ERFMTDA, an enhanced rotative factorization machine framework for predicting potential tsRNA-disease associations. ERFMTDA explicitly models complex interactions among heterogeneous biological features while integrating latent structural representations derived from the global association matrix. In addition, a biologically informed negative sampling strategy based on motif-level sequence similarity is introduced to improve the reliability of negative samples. Extensive experiments demonstrate that ERFMTDA consistently outperforms eleven state-of-the-art methods. Case studies on diabetic retinopathy and hepatocellular carcinoma further confirm its ability to prioritize biologically meaningful tsRNA-disease associations. Availability and implementationThe source codes and datasets of ERFMTDA are available at https://github.com/lanbiolab/ERFMTDA.
Huang, K.-l.
Show abstract
Quality control (QC) of high-throughput sequencing data is a critical first step in genomics analysis pipelines. FastQC has served as the de facto standard for sequencing QC for over a decade, but its Java runtime dependency introduces startup overhead, elevated memory consumption, and deployment complexity. Meanwhile, the growing adoption of long-read sequencing platforms from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) has created a pressing demand for QC tools capable of handling both short and long reads. However, existing solutions require separate tools for each data type and an additional aggregation tool, such as MultiQC, to consolidate results across samples. Here we present RastQC, a unified sequencing QC tool written in Rust that combines FastQC-compatible short-read QC, long-read-specific metrics, built-in multi-sample summary, native MultiQC JSON export, and a web-based report viewer in a single 2.1 MB static binary. RastQC implements all 12 standard FastQC modules with matching algorithms, plus 3 long-read modules (Read Length N50, Quality Stratified Length, and Homopolymer Content), achieving 100% module-level concordance with FastQC across 55 out of 55 calls on five model organisms. RastQCs streaming parallel pipeline with adaptive batch sizing delivers 1.8-3.2x speedup on short-read Illumina data and 4.7-6.5x speedup on long-read ONT/PacBio data, while using 8-9x less memory on small files and comparable memory on large files. RastQC is freely available and is available as an AI agent skill at https://github.com/Huang-lab/RastQC under the MIT license.
Mahlich, Y.; Ross, D. H.; Monteiro, L.; McDermott, J. E.
Show abstract
MotivationDespite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein, and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. ResultsThe VaLPAS (Variation-Leveraged Phenomic Association Screen) framework is available as a Python package and provides a user-friendly platform for calculation of associations between expression patterns of genes or proteins in multi-omic datasets based on various statistical and learning methods. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of molecules using guilt by association with molecules of known function. We present results demonstrating the utility of VaLPAS to identify high-confidence predictions for a subset of genes/proteins of unknown function in a previously published multi-omics dataset from the oleaginous yeast, Rhodotorula toruloides. AvailabilityVaLPAS is written in Python. The code is hosted on github (https://github.com/PNNL-Predictive-Phenomics/valpas/).
Appulingam, Y.; Jammal, J.; Ali, A.; Topp, S.; NYGC ALS Consortium, ; Iacoangeli, A.; Pain, O.
Show abstract
BackgroundDifferential expression analysis is a central tool for studying the biological processes altered in human diseases via transcriptomic signatures. However, transcriptomic datasets are systematically confounded by latent variables from two distinct sources: unmeasured technical and biological heterogeneity within the expression data, and expression differences driven by population stratification. Correction using expression-based surrogate variables (SVs) and genotype-based principal components (PCs) addresses these sources independently, yet no study has directly evaluated their combined use against either method alone within a differential expression framework. In this study we hypothesised that simultaneously including both correction layers would produce more biologically valid and reproducible results than either approach alone, and tested this in two independent RNA-seq datasets of amyotrophic lateral sclerosis (ALS) cases and controls with matching genotype data. ResultsFour nested differential expression models (corrected for PC-only, SV-only, both SV and PC, and neither PCs nor SVs) were evaluated across the KCLBB (96 cases and 52 controls) and ALS Consortium (272 cases and 35 controls) datasets. Models were evaluated on: cross-dataset effect size concordance, cross-dataset replicability quantified by the Jaccard Similarity Index, and biological recall against a curated reference set of 66 known ALS genes. The combined SV+PC framework consistently outperformed simpler models across all metrics. Replicability improved nearly ten-fold compared to the non-corrected model, (Jaccard index: 2.28% to 19.5%), and the combined framework exhibited a statistically significant 2.1% gain over the SV-only model. The biological recall ALS genes recovered doubled comparing to the SV correction alone. Crucially, effect size stability was preserved, with the combined model expanding the shared transcriptomic signal without sacrificing consistency. These findings remained generally robust to PC number in sensitivity analyses. ConclusionsThis study found that SVs and genotype PCs address non-redundant sources of confounding, and we recommend their combined use as standard practice in differential expression analysis where matched genotype data are available. Notably PCs capturing population structure can also be derived directly from RNA-seq data, extending the applicability of this framework to studies lacking matched genotype data. Although this analysis was restricted to ALS datasets, we expect these findings to generalise to other traits.
Parmigiani, L.; Peterlongo, P.
Show abstract
A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.